多语言神经文本到语音(NTTS)系统的基本设计决策是如何表示模型中的输入语言特征。查看文献中各种各样的方法,出现了两个主要范式,统一和单独的表示。前者在跨语言中使用一组共享的语音令牌,而后者为每种语言使用独特的语音令牌。在本文中,我们进行了一项全面的研究,比较了两种表示训练的多语言NTTS系统模型。我们的结果表明,统一方法始终在自然和口音方面始终获得更好的跨语性综合。单独的表示形式往往比统一的代币更大的令牌,这可能会影响模型容量。因此,我们进行了一项消融研究,以了解表示类型与令牌嵌入尺寸的相互作用。我们发现,两个范式之间的差异仅在一定阈值嵌入尺寸之上出现。这项研究提供了有力的证据,表明在构建多语言NTTS系统时,统一表示应该是首选的范式。
translated by 谷歌翻译
培训仅使用单语言语料库的多语言神经文本到语音(NTTS)模型已成为构建基于语音克隆的Polyglot NTTS系统的流行方式。为了训练这些模型,必须了解培训语料库的组成如何影响多语言语音综合的质量。在这种情况下,通常会听到诸如“包含更多西班牙数据有助于我的意大利综合,考虑到两种语言的亲密关系?”之类的问题。不幸的是,我们发现有关该主题缺乏完整性的现有文献。在目前的工作中,我们进行了一项广泛的消融研究,旨在了解培训语料库的各种因素(例如语言家族隶属关系,性别组成和演讲者的数量)如何有助于多面化综合的质量。我们的发现包括在大多数情况下首选女性扬声器数据的观察结果,并且在培训语料库中拥有更多来自目标语言的说话者并不总是有益的。此处的发现对于数据采购和语料库构建过程提供了信息。
translated by 谷歌翻译
我们引入了一种新的自动评估方法,用于说话者相似性评估,这与人类感知得分一致。现代神经文本到语音模型需要大量的干净训练数据,这就是为什么许多解决方案从单个扬声器模型转换为在许多不同扬声器的示例中训练的解决方案的原因。多扬声器模型带来了新的可能性,例如更快地创建新声音,也是一个新问题 - 扬声器泄漏,其中合成示例的扬声器身份可能与目标扬声器的示例不符。当前,发现此问题的唯一方法是通过昂贵的感知评估。在这项工作中,我们提出了一种评估说话者相似性的自动方法。为此,我们扩展了有关说话者验证系统的最新工作,并评估不同的指标和说话者嵌入模型如何以隐藏的参考和锚(Mushra)分数反映多个刺激。我们的实验表明,我们可以训练一个模型来预测扬声器嵌入的扬声器相似性,其精度为0.96的扬声器嵌入,并且在话语级别上最高0.78 Pearson分数。
translated by 谷歌翻译
In this work, we address the problem of unsupervised moving object segmentation (MOS) in 4D LiDAR data recorded from a stationary sensor, where no ground truth annotations are involved. Deep learning-based state-of-the-art methods for LiDAR MOS strongly depend on annotated ground truth data, which is expensive to obtain and scarce in existence. To close this gap in the stationary setting, we propose a novel 4D LiDAR representation based on multivariate time series that relaxes the problem of unsupervised MOS to a time series clustering problem. More specifically, we propose modeling the change in occupancy of a voxel by a multivariate occupancy time series (MOTS), which captures spatio-temporal occupancy changes on the voxel level and its surrounding neighborhood. To perform unsupervised MOS, we train a neural network in a self-supervised manner to encode MOTS into voxel-level feature representations, which can be partitioned by a clustering algorithm into moving or stationary. Experiments on stationary scenes from the Raw KITTI dataset show that our fully unsupervised approach achieves performance that is comparable to that of supervised state-of-the-art approaches.
translated by 谷歌翻译
Many prior language modeling efforts have shown that pre-training on an in-domain corpus can significantly improve performance on downstream domain-specific NLP tasks. However, the difficulties associated with collecting enough in-domain data might discourage researchers from approaching this pre-training task. In this paper, we conducted a series of experiments by pre-training Bidirectional Encoder Representations from Transformers (BERT) with different sizes of biomedical corpora. The results demonstrate that pre-training on a relatively small amount of in-domain data (4GB) with limited training steps, can lead to better performance on downstream domain-specific NLP tasks compared with fine-tuning models pre-trained on general corpora.
translated by 谷歌翻译
In this paper, we explore the relationship between an individual's writing style and the risk that they will engage in online harmful behaviors (such as cyberbullying). In particular, we consider whether measurable differences in writing style relate to different personality types, as modeled by the Big-Five personality traits and the Dark Triad traits, and can differentiate between users who do or do not engage in harmful behaviors. We study messages from nearly 2,500 users from two online communities (Twitter and Reddit) and find that we can measure significant personality differences between regular and harmful users from the writing style of as few as 100 tweets or 40 Reddit posts, aggregate these values to distinguish between healthy and harmful communities, and also use style attributes to predict which users will engage in harmful behaviors.
translated by 谷歌翻译
3D autonomous driving semantic segmentation using deep learning has become, a well-studied subject, providing methods that can reach very high performance. Nonetheless, because of the limited size of the training datasets, these models cannot see every type of object and scenes found in real-world applications. The ability to be reliable in these various unknown environments is called domain generalization. Despite its importance, domain generalization is relatively unexplored in the case of 3D autonomous driving semantic segmentation. To fill this gap, this paper presents the first benchmark for this application by testing state-of-the-art methods and discussing the difficulty of tackling LiDAR domain shifts. We also propose the first method designed to address this domain generalization, which we call 3DLabelProp. This method relies on leveraging the geometry and sequentiality of the LiDAR data to enhance its generalization performances by working on partially accumulated point clouds. It reaches a mIoU of 52.6% on SemanticPOSS while being trained only on SemanticKITTI, making it state-of-the-art method for generalization (+7.4% better than the second best method). The code for this method will be available on Github.
translated by 谷歌翻译
Quantitative cancer image analysis relies on the accurate delineation of tumours, a very specialised and time-consuming task. For this reason, methods for automated segmentation of tumours in medical imaging have been extensively developed in recent years, being Computed Tomography one of the most popular imaging modalities explored. However, the large amount of 3D voxels in a typical scan is prohibitive for the entire volume to be analysed at once in conventional hardware. To overcome this issue, the processes of downsampling and/or resampling are generally implemented when using traditional convolutional neural networks in medical imaging. In this paper, we propose a new methodology that introduces a process of sparsification of the input images and submanifold sparse convolutional networks as an alternative to downsampling. As a proof of concept, we applied this new methodology to Computed Tomography images of renal cancer patients, obtaining performances of segmentations of kidneys and tumours competitive with previous methods (~84.6% Dice similarity coefficient), while achieving a significant improvement in computation time (2-3 min per training epoch).
translated by 谷歌翻译
The accurate prediction of physicochemical properties of chemical compounds in mixtures (such as the activity coefficient at infinite dilution $\gamma_{ij}^\infty$) is essential for developing novel and more sustainable chemical processes. In this work, we analyze the performance of previously-proposed GNN-based models for the prediction of $\gamma_{ij}^\infty$, and compare them with several mechanistic models in a series of 9 isothermal studies. Moreover, we develop the Gibbs-Helmholtz Graph Neural Network (GH-GNN) model for predicting $\ln \gamma_{ij}^\infty$ of molecular systems at different temperatures. Our method combines the simplicity of a Gibbs-Helmholtz-derived expression with a series of graph neural networks that incorporate explicit molecular and intermolecular descriptors for capturing dispersion and hydrogen bonding effects. We have trained this model using experimentally determined $\ln \gamma_{ij}^\infty$ data of 40,219 binary-systems involving 1032 solutes and 866 solvents, overall showing superior performance compared to the popular UNIFAC-Dortmund model. We analyze the performance of GH-GNN for continuous and discrete inter/extrapolation and give indications for the model's applicability domain and expected accuracy. In general, GH-GNN is able to produce accurate predictions for extrapolated binary-systems if at least 25 systems with the same combination of solute-solvent chemical classes are contained in the training set and a similarity indicator above 0.35 is also present. This model and its applicability domain recommendations have been made open-source at https://github.com/edgarsmdn/GH-GNN.
translated by 谷歌翻译
Recent advances in deep learning models for sequence classification have greatly improved their classification accuracy, specially when large training sets are available. However, several works have suggested that under some settings the predictions made by these models are poorly calibrated. In this work we study binary sequence classification problems and we look at model calibration from a different perspective by asking the question: Are deep learning models capable of learning the underlying target class distribution? We focus on sparse sequence classification, that is problems in which the target class is rare and compare three deep learning sequence classification models. We develop an evaluation that measures how well a classifier is learning the target class distribution. In addition, our evaluation disentangles good performance achieved by mere compression of the training sequences versus performance achieved by proper model generalization. Our results suggest that in this binary setting the deep-learning models are indeed able to learn the underlying class distribution in a non-trivial manner, i.e. by proper generalization beyond data compression.
translated by 谷歌翻译